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In  [SS32]  Snir  and  Solworth  describe  a  design  for  the 
switch  component  of  the  • Ul t racomput er '  [GGKMRS82].  Using 
16-bit  packet  switching  (a  read  request  would  be  2  packets 
and  a  write  request  would  be  4  packets),  they  give  designs 
for  a  1-chip  switch  with  151  pins,  a  2-chip  switch  with  a 
total  of  230  pins,  a  5-chip  switch  with  345  pins,  and  a 
9-chip  switch  with  549  pins.  All  these  designs  provide  for 
combining  of  read,  write,  and  f e tch-and-add  operations. 

In  this  note  we  show  that  an  8-by-8  switch,  which  will 
perform  the  function  of  12  2-by-2  switches,  can  be  built 
using  16  59-pin  chips,  16  58-pin  chips  and  4  49-pin  chips. 
This  is  better  than  3  59-pin  chips  per  2-by-2  switch.  If 
296-pin  chips  are  available,  only  4  chips  would  be  required 
for  the  8-by-8  switch,  less  than  1/3  chip  per  2-by-2  switch. 

The  bandwidth  of  a  network  constructed  of  8-by-8 
switches  is  expected  to  be  higher  than  one  constructed  of 
2-by-2  switches,  since  there  is  less  blocking.  A  further 
difference  is  that  a  packet  moves  thru  the  8-by-8  switch  in 
one  clock  cycle,  (and  in  the  36-chip  version,  passes  thru 
only  two  chips  rather  than  three).  This  should  give  a 
speedup  of  a  factor  of  between  1.5  and  3.  These  numbers  are 
based  on  a  rough  design,  so  may  have  to  be  modified  on  the 
basis  of  a  detailed  implementation. 


1 .0   INTRODUCTION 


We  consider  here  a  partitioning  of  the  Ul t racomput er 
switching  network  which  permits  bit-slicing.  We  first  note 
that  a  combinational  crosspoint  switch  can  be  bit-sliced. 
The  number  of  control  lines  required  for  a  1-bit  slice  of  an 
M-by-H  crosspoint  switch  is  M»log(M),  since   the   source   of 
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each   of   M  outputs  requires  log(M)  bits  to  specify.   Thus  a 
k-bit  slice  will  require  at  least  2»k»M  +  M»log(M)  pins,  and 
the  whole  crosspoint  switch  will  require  at  least 
(log  M)/k)  pins.   Thus  bit-slicing  does  not  begin 
substantially   more   pins   till   k   becomes  lower 
log(M)/2.    With   k   equal   to   about   log(M)/2, 
bit-slice  of  4»M»M  pins. 


M»M  »  (2  + 
to  require 
than  about 
we   get   a 


More  specifically,  if  we  have  60-pin  chips  available, 
we  could  get  a  highly  pin-efficient  slice  of  a  2-by-2  or 
H_by-4  crosspoint,  with  a  2-bit  slice  of  an  8-by-8  switch 
being  acceptable  (32  data  lines  and  24  control  lines).  For 
a  l6-by-l6  switch  we  would  need  64  control  lines,  so  chips 
of  162  pins  would  be  needed  to  get  aore  data  than  control 
lines.  Similarly,  a  32-by-32  switch  would  require  256 
control  lines,  and  a  pin-efficient  8-bit  slice  would  require 
more  than  768  pins. 

Some  improvements  in  pins  per  chip  can  be  made  by 
providing  all  data  inputs  to  a  slice,  but  only  generating 
part  of  the  output.  This  permits  a  chip  implementing  an 
8-bit  slice  of  a  32-by-32  switch,  for  example,  to  be 
implemented  as  2  chips  with  16  outputs.  However,  note  that 
this  always  increases  the  total  number  of  pins. 

In  the  design  discussed  below,  we  consider  an  8-by-3 
switch.  This  is  compatible  with  the  proposed  Ul tracomput er 
physical  design,  and  also  with  currently  available  chip 
packaging.  It  also  appears  to  be  competitive  with  pin 
counts  up  to  several  hundred,  though  with  substantial 
increases  in  pin  counts,  l6-by-l6  or  larger  switches  may 
also  be  feasible. 


2.0   OVERALL  DESIGtl 

The  design  described  below  uses  a  combinational  circuit 
to  implement  a  l6-bit  wide  8-by-8  crossbar  switch.  The 
output  of  this  switch  is  fed  into  a  combining  queue  somewhat 
similar  in  design  to  that  used  by  [3382].  The  block  diagram 
is  shown  in  figure  1. 
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3.0   CHIP  FUHCTIOMS 


The  block  labelled 
8-by-8  crosspoint  switch 
si  ices . 
its  pin 


'8*8  xpoint"  above  is  a  16-bit 
which  is  constructed  out  of  8  2-bit 
2-bit  slice  as  the  "router"  chip; 
as  follows: 


bits  from  8 
(3  bits  per 


sources) 

sink  ) 


We  refer  to  the 
designations  are 
Router  -  58  pins 
inputs  : 

di [0  .  .7  ,0  .  .  1  ]    data  input  (2 
addr [0  .  .7  ,0  .  .2  ]  routing  input 
ground  and  power 
outputs : 

do [0.  .7  ,  0  .  .  1  ]  data  output  (2  bits  to  8  sinks) 
This  chip  is  a  combinational  circuit,  whose  control  input 
(addr)  consists  of  8  sets  of  3  lines,  with  addr[i,»] 
specifying  which  input  is  output  on  do[i]  (i.e.  do[i,*]  = 
di [addr [ i, »],*])  .  Eight  router  chips  are  used  in  parallel 
to  create  a  l6-bit  8-by-8  crossbar. 

The  routing  for  the  crossbar  switch   is   determined   by 
two   arbiter   chips,   each  of  which  provides  control  signals 
for  4  input  and  output  channels  for  the  router   chips.    Pin 
designations  for  the  arbiter  chip  are  as  follows: 
Arbi  ter  -  48  pins 


inputs  : 

asCO. 


7,0.  .2] 


address/status  input 

(3  bits  for  8  channels) 
h  which  half  is  to  be  controlled 

clock  and  reset 
ground  and  power 
outputs : 

addr [0  .  .  3  ,  0  .  .2  ]  address  (4  sinks) 
dr[0..4]         data  received  (4  sources) 
da[0.,3]         data  available  (4  sinks) 
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are  required  to   produce   the   40   control 

chip   generating   5   control  lines  for  k 

arbiters  operate  identically   apart   from 

(fixed)  control  line  h  is  used  to  specify 

reduces  the  control  signals  for  the   first 

of   the   channels.    For  each  channel,  the 

generated   by   the   arbiter    chips    are: 

r   the  (i+h*4)'th  input  is  received  by  the 

,  whether  data  will  be   delivered   to   the 

;    and   the   routing   addresses   for   the 

( addr [ j , i+h*4 ]    specifies    that    the 

is  the  j'th  input).   The  as[*,»]  inputs  to 

scribed  later. 


The  output  of  the  cros 
8  combining  queue  (CQ)  c 
the  8  decombining  queue  chi 
queue  chips  are  similar 
permit  the  arbiter  chips 
time-multiplexing  the  thre 
The  pin  designations  are  as 
Combining  Queue  (CQ 
inputs : 

di[0.  .1 5] 
ida 
odr 

clock  and  r 
ground  and 
outputs  : 
qf 

do[0.  .  1 5] 
addrCO. .2] 

Ip 

pnc 

cdoCO.  .  1 5] 

cda 
On  even  cycles,  when  the  pa 
may   contain  an  address,  as 
odd  cycles,  Ip  (alias  as[0] 
current  packet  is  the  last 
can  prepare  to  latch  a  new 
is   used   to   indicate  that 
the  next  cycle.   The  line 
driven   on  odd  cycles,  so  t 
to  get  the  qf  signal  from  t 
level . 


spoint  switch  is  the  input  to  the 
hips  on  the  PE-to-MM  side,  and  to 
ps  on  the  MIl-to-PE  side.  The 
to  those  described  in  [SS82],  but 
to  economise  on  pins  by 
e  control  lines  for  two  purposes. 

follows  : 
)  (PE-to-MM)  -  59  pins 

input  data 

input  data  available 

output  data  received 

ese  t 

power 

queue  full 

output  data 

address  of  output 

last  packet  (same 

packet  next  cycle 

combine  data  output 

combine  data  available 
cket  being  transmitted  on  do[*] 
[*]  will  specify  the  address.  On 
)  is  used  to  assert  that  the 
in  a  message  (so  that  the  arbiter 
address),  and  pnc  (alias  as[1]) 
there  will  be  a  packet  output  on 
as[2]  is  3-state,  and  is  not 
hat  it  can  be  used  by  the  A  chips 
he   CQ   chip   in   the   succeeding 


pin  as  addr[0]) 
(same  pin  as  addrCl] 


The  cdo[»]  output  from  the  CQ  chips  contains 
information  which  specifies  which  messages  have  been 
combined.  This  is  used  by  the  DQ  chip  to  generate  the 
appropriate  responses.  The  principle  is  the  same  as  that 
used  in  [SS82],  but  permits  more  than  two  messages  to  be 
combined.   This  is  discussed  in  more  detail  below. 
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The  pin  designation  of  the  decombining  queue  (DQ)   chip 
are  similar  to  those  of  the  combining  queue  chip: 

Decombining  Queue  (DQ)  (MM-to-PE)  -  59  pins 
i  nputs  : 

di [ 0 . . 1 5  ]        input  data 
ida  input  data  available 

odr  output  data  received 

cdi[0..15]       combine  data  input 
cda  combine  data  available 

clock  and  reset 
ground  and  power 
outputs : 

qf  queue  full 

do[0..15]        output  data 
addr[0..2]       address  of  output 

Ip  last  packet  (same  pin  as  addr[0]) 

pnc  packet  next  cycle 

The  cdo[*]  output  of  the  combining  queue  chip  CQ[i]  will  be 
connected  to  the  cdi[»]  input  of  the  decombining  queue 
DQ[i],  which  will  store  it  in  its  associative  memory  for 
reconstruction  of  messages  which  have  been  combined. 


(same  pin  as  addr[1]) 


3.1   Speed  Improvements 
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Additional  improvement  might  be  obtained  by  making 
even  cycles  slightly  longer  than  the  odd  cycles. 


the 


4.0   MULTIPLE  COMBININGS 


This  design  does  not  provide  as  many  opportunities  for 
messages  to  combine  as  the  [SS82]  design.  In  the  (worst) 
case  of  all  N=2**(3n)  processors  doing  a  simultaneous 
f etch-and-add  on  the  same  location,  use  of  8-by-8  switches 
and  [SS32]-like  queues  would  permit  at  most  one  combine  on 
each  input  at  n  levels,  giving  a  serial  stream  at  the  memory 
of  lengtn  about  2»»(2n).   If   N=4096,   n=4,   so   this   is   a 
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serial   stream   of   length   256,   which   would   probably   be 
unacceptable  for  many  applications. 

This  can  be  improved  by  permitting  multiple  combines  at 
each  level;  this  would  still  not  give  the  same  performance 
as  the  [SS82]  design,  since  inputs  to  the  switches  would 
have  to  serialise  to  at  least  length  8  in  order  to  combine 
completely.  (Note  that,  in  a  later  design,  Solworth  [S083] 
proposes  a  queue  design  which  would  also  serialise  to  3n 
requests  in  the  fully  loaded  case).  A  design  which  permits 
the  combination  of  an  arbitrary  number  of  messages  at  each 
queue  is  given  in  the  following  section. 

The  [SS82]  queue  design  does  not  extend  in  an  obvious 
way  to  multiple  combinings  for  two  reasons.  First,  the 
information  about  multiple  combinings  must  be  serialised 
before  communication  to  the  decombining  queue;  and  second, 
the  queue  design  itself  separates  in  time  the  comparison  and 
the  addition  operations,  with  the  addition  occurring  just 
prior  to  output . 
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Note  that  though  the  probability  of  encountering  the 
worst  case  is  low,  the  problem  could  still  arise  with  a 
relatively  small  number  of  requests.   Suppose,  for   example, 
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that  every  64th  processor  issued  a  simultaneous 
f et ch-and-add  on  the  same  location.  These  would  not  combine 
in  the  first  two  layers,  but  would  still  serialise  to  a 
length  of  16  in  the  second  two  layers.  If  simulation  proves 
that  this  might  be  a  problem,  one  option  might  be  to  build 
the  first  few  layers  with  8  by  8  switches,  and  use  2  by  2 
switches  for  the  remainder. 


4.1   Design  For  A  Combining  Queue 


The  design  suggested  below  combines  messages  on   input, 
output   as   in  [SS82].   The  registers  used 


rather   than   on 
include  the  folio 
IR[0.  .15] 
IPM[0. .1  ] 
QR [0 . . qma 
QRPHCO. .q 
QRF[0 . .qm 
QRE[0  .  .qm 
QRCCO.  .qm 
QE 
QF 
Packets  may  be  in 
and   in   the   abs 
into  some  QR[ i  ]  , 
output   from   QR[ 
output  from  IR  di 
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compared  simultan 
QRE[i]   is   set  i 
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operations ,   and 
added  into  QR [ i  ]  , 
the  DQ  chip. 
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x][0. . 15] 
max  ]  [0  .  .  1  ] 
ax  ] 
ax] 
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input  register 
input  packet  number 
queue  registers 
queue  register  packet 
queue  register  full 
queue  register  equal 
queue  register  carry 
queue  empty 
queue  full 
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and  IPN  on   each   clock 
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is  empty,  packets  may  be 

0  or  1  (i.e.   the  packet 

output  of  IR  and  IPN  are 

QR[i]   and   QRPN[i]   and 

f  the  operation  and  addresses  are  the  same. 

and  the  packets  are  data  for   f e tch-and-add 

the   appropriate  QRE[i]  is  set,  IR  will  be 

and  the  contents  of  IR  will  be   output   to 


VJhen  a  message  which  is  the  result  of  a  combine 
operation  is  output  by  the  CQ  chip,  it  is  also  sent  to  the 
DQ  chip.  This  can  be  done  in  two  ways:  on  the  cdo[«]  lines 
or  on  a  separate  set  of  lines.  If  the  cdo[*]  lines  are 
used,  the  simplest  solution  would  probably  be  to  prevent  the 
CQ  chip  accepting  an  incoming  message  while  the  combined 
message  was  being  output;  alternatively,  the  cdo[*]  outputs 
could  be  queued.  With  the  alternate  packaging  of  the  CQ  and 
DQ  chips  suggested  in  the  next  section,  another  set  of  lines 
connecting  the  CQ  and  DQ  chips  could  be  used. 
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5.0   BANDWIDTH 

Ho  precise  results  are  reported  here  for  the  expected 
bandwidth  of  a  network  constructed  of  8-by-8  switches. 
However,  a  result  due  to  Spirakis  [Sp83]  shows  that,  with  no 
combining,  the  average  queue  length  using  8-by-8  switches 
will  be  smaller  by  a  factor  of  7/4  than  that  for  2-by-2 
swi  tches . 
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6.0   THE  EFFECT  OF  HIGHER  PIN  COUNTS 

The  design  described  above  can  also  benefit 
dramatically  from  increases  in  pin  count.  At  one  extreme,  a 
single-chip  implementation  could  be  done  in  a  580  pin  chip. 
Such  an  impl ementaion  would  be  faster  than  the  multiple  chip 
implementation  since  there  would  be  less  off-chip 
connections,  and  also  some  encoding  and  decoding  could  be 
eliminated.  A  two  chip  version  with  428  pin  chips  could  be 
obtained  by  splitting  the  PE-to-MH  and  MH-to-PE  halves; 
this  might  be  useful  if  packaging  permitted  local 
connections  between  two  chips  on  the  same  carrier. 


However,  even  modest  increases  in  pin  count   above   the 
60   pins   required   for   the   36   chip   design  could  produce 


considerable  improvement.   With 
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V;ith  188  pin  chips,  the  crosspoints  could  handle  8  bits 
and  do  their  own  arbitration,  and  2  double  queues  could  fit 
on  one  chip.  The  result  would  be  an  8  chip  design,  better 
than  any  design  based  on  2-by-2  switches. 

Lastly,  a  4  chip  design  would  require  296  pin  chips. 
These  would  coaprise  2  identical  16-bit  8-by-8  crosspoint 
switches,  and  two  identical  queue  chips,  each  of  which  would 
handle  4  queues. 


7.0   32  BIT  PACKETS 

This  design  generalises  fairly  well  to  larger  packets. 
The  router  and  arbiter  chips  remain  much  the  same,  but  the 
queues  get  more  complicated,  and  require  more  pins.  Uith  32 
inputs  and  outputs  per  queue,  107  pins  would  be  needed. 

An  alternative  way  of  implementing  a  32  bit  queue  would 
be  to  use  2  chips  with  32  bit  inputs  and  16  bit  outputs 
(actually  one  with  a  17  bit  output  and  one  with  a  15  bit 
output;  the  latter  would  generate  the  control  signals 
also).  This  would  permit  a  32  bit  queue  on  a  72  pin  chip. 
The  resulting  design  would  have  4  48  pin  chips,  32  58  pin 
chips,  and  16  72  pin  chips,  only  slightly  more  pins  than  two 
1 6-bi  t  sw  itches . 

An  8  chip  design  could  be  done  with  312  pins  per  chip. 


8.0   OTHER  BIT-SLICE  METHODS 
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